servstate: remove on-check-failure test races #272

flotter · 2023-08-02T11:24:26Z

This patch improves the following tests, all related to checker functionality in servstate observed when running under stress:

TestOnCheckFailureRestartWhileRunning
TestOnCheckFailureRestartDuringBackoff
TestOnCheckFailureIgnore
TestOnCheckFailureShutdown

Since we are unit testing servstate, and not checkstate, this patch adds code to more effectively control (mock) when the check failure actions are delivered during the test. This removes the requirement for tests to rely on finely tuned delays to check various service life-cycle states.

The following changes are made:

Round some of the delays to multiples of 100ms to highlight the fact these values are less important now. Delays intended to present infinite are replaced with 10s to ensure severe failures are observed if the test does not work as intended.
Remove iteration code to wait for a checker action, and replace with channel sync code.
Add some additional function comments to explain what the tests aim to confirm.

This patch improves the following tests, all related to checker functionality in servstate observed when running under stress: TestOnCheckFailureRestartWhileRunning TestOnCheckFailureRestartDuringBackoff TestOnCheckFailureIgnore TestOnCheckFailureShutdown Since we are unit testing servstate, and not checkstate, this patch adds code to more effectively control (mock) when the check failure actions are delivered during the test. This removes the requirement for tests to rely on finely tuned delays to check various service life-cycle states. The following changes are made: - Round some of the delays to multiples of 100ms to highlight the fact these values are less important now. Delays intended to present infinite are replaced with 10s to ensure severe failures are observed if the test does not work as intended. - Remove iteration code to wait for a checker action, and replace with channel sync code. - Add some additional function comments to explain what the tests aim to confirm.

internals/overlord/servstate/manager_test.go

benhoyt · 2023-08-02T23:00:41Z

Thanks for this! Definitely a nice incremental improvement. I updated the PR to fix the nits I mentioned so I can merge right away.

barrettj12 · 2023-08-03T00:26:47Z

@flotter synchronising on a channel seems like a good idea. Nonetheless, I am still observing intermittent failures in the following tests:

TestOnCheckFailureRestartWhileRunning

$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureRestartWhileRunning -count=1 -race
2023-08-03T00:16:46.601Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-3400122428/1/out; sync; touch /tmp/check-3400122428/0/test2; sleep 10'
2023-08-03T00:16:46.702Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:16:46.702Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:16:46.928Z [test] Service "test2" on-check-failure action is "restart", terminating process before restarting
2023-08-03T00:16:46.929Z [test] Service "test2" exited after check failure, restarting
2023-08-03T00:16:46.929Z [test] Service "test2" on-check-failure action is "restart", waiting ~100ms before restart (backoff 1)
2023-08-03T00:16:46.929Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:16:46.971Z [test] Service "test2" stopped while waiting for backoff

----------------------------------------------------------------------
FAIL: manager_test.go:984: S.TestOnCheckFailureRestartWhileRunning

manager_test.go:1044:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.40s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.458s
FAIL

TestOnCheckFailureIgnore

$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureIgnore -count=1 -race
2023-08-03T00:18:28.213Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-3858153059/1/out; sync; touch /tmp/check-3858153059/0/test2; sleep 10'
2023-08-03T00:18:28.313Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.313Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:18:28.444Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.513Z [test] Check "chk1" failure 3 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.544Z [test] Service "test2" stopped

----------------------------------------------------------------------
FAIL: manager_test.go:1167: S.TestOnCheckFailureIgnore

manager_test.go:1226:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.34s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.374s
FAIL

TestOnCheckFailureShutdown

$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureShutdown -count=1 -race
2023-08-03T00:19:01.028Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-1520139323/1/out; sync; touch /tmp/check-1520139323/0/test2; sleep 10'
2023-08-03T00:19:01.173Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:19:01.173Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:19:01.546Z [test] Service "test2" on-check-failure action is "shutdown", triggering server exit
2023-08-03T00:19:01.546Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:19:01.604Z [test] Service "test2" stopped

----------------------------------------------------------------------
FAIL: manager_test.go:1243: S.TestOnCheckFailureShutdown

manager_test.go:1302:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.60s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.628s
FAIL

flotter · 2023-08-03T08:39:18Z

@flotter synchronising on a channel seems like a good idea. Nonetheless, I am still observing intermittent failures in the following tests:

TestOnCheckFailureRestartWhileRunning
TestOnCheckFailureIgnore
TestOnCheckFailureShutdown

Sorry about this Jordan. I feel embarrassed, but will fix that asap.

barrettj12 · 2023-08-04T00:58:51Z

No need to be embarrassed, you're doing more than the rest of us ;)

flotter requested a review from barrettj12 August 2, 2023 11:24

flotter mentioned this pull request Aug 2, 2023

Intermittent test failures: servstate.TestOnCheckFailure* #270

Closed

jnsgruk linked an issue Aug 2, 2023 that may be closed by this pull request

Intermittent test failures: servstate.TestOnCheckFailure* #270

Closed

jnsgruk requested a review from benhoyt August 2, 2023 12:01

flotter force-pushed the servstate-stress-fixes branch from 671d52b to 4cd9ab4 Compare August 2, 2023 12:14

Fix review nits

6a43869

benhoyt approved these changes Aug 2, 2023

View reviewed changes

internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved

internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved

benhoyt merged commit f593810 into canonical:master Aug 2, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

servstate: remove on-check-failure test races #272

servstate: remove on-check-failure test races #272

flotter commented Aug 2, 2023

benhoyt commented Aug 2, 2023

barrettj12 commented Aug 3, 2023

flotter commented Aug 3, 2023

barrettj12 commented Aug 4, 2023

servstate: remove on-check-failure test races #272

servstate: remove on-check-failure test races #272

Conversation

flotter commented Aug 2, 2023

benhoyt commented Aug 2, 2023

barrettj12 commented Aug 3, 2023

flotter commented Aug 3, 2023

barrettj12 commented Aug 4, 2023